Number of rows and columns of the data
## [1] 641138 26
Number of unique users
## [1] 641138 30
Histograms of number of certificates:
##
## 0 1 2 3 4 5
## 613463 23966 2974 609 79 47
pass.rate of each courseage among registered usersgender among registered users## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
Histogram of access.rate that certified==0
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
Histogram of access.rate that certified==1
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
This data has *** column and *** rows. Each row is the data for a user taking a course.
We mainly interested in the backgrounds and involvement of the students that attended and earned certification of each course. The features in this data set that we will focus on are: registered, certified, LoE_DI, YoB, gender, nevents and ndays_act.
The following new variables are created in the original dataframe edxdata:
age: the age of the user when taking the course. It is calculated by 2013-YOB.access.period: Day difference between last_event_DI and start_time_DI.access.rate: ndays_act divided by access.period. This variable essentially measures the how often a users accesses the course.Also, we grouped the raw data by the following a number of features and created new variables for each data set for each new data sets:
The data frame grouped and summarised the data by each user.
course_taken: number of courses viewed.total_registered: number of courses registered.total_explored: number of courses explored.user.certificates: number of courses certified.We grouped and summarised the data by course_id. The following new variables are created:
passed_num: total certified users of the course.explored_num: total users who explored the course.registered_num: total users who registered the course.total_nforum_posts: total number of posts in the course forum.pass.rate : the number of certificated users divided by the number of registered usershangon.rate : the number of explored users divided by the number of registered usersSome of the course was offered more than one times durin 2012 and 2013. Therefore, another data frame is generated by grouping and summaring the raw data by course_code.
t_certified: number of total certified users.t_viewed: number of users who viewed the course.t_explored: number of users who explored the course.t_registered: number fo users who registered the course.A few preprocessings of the raw data were performed, as listed below
LoE_DI into levels.certified, explored and viewed into logical type.Histogram of certified users by LoE_DI
Histogram of certified users by gender
Distrubution of nevents of certified users only
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
Boxplot of nevents against course
Boxplot of ndays_act against course
Boxplot of access.periods of each course
Boxplot of access.periods of each course
Analyze average login events of certified users
## Source: local data frame [13 x 3]
##
## course_code mean_nevents med_nevents
## 1 CB22x 2752.8906 2292.0
## 2 CS50x 284.1166 236.5
## 3 ER22x 1387.1113 1338.5
## 4 PH207x 6144.8561 5465.0
## 5 PH278x 1739.2003 1496.0
## 6 6.002x 5353.4035 4481.5
## 7 2.01x 5312.8340 4826.0
## 8 14.73x 4797.0168 4508.5
## 9 3.091x 7269.2455 6377.5
## 10 6.00x 7953.6741 7227.5
## 11 7.00x 5942.5189 5548.0
## 12 8.02x 9678.3005 9054.5
## 13 8.MReV 7108.4949 6568.0
The distribution LoE_DI of certified users
Distribution of LOE_DI of all users
ndays_act vs. nevents among the users who has explored more than half of the course:
access.rate vs. grade
Plot the distribution of nevents
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
Plot the distribution of nevents of certified users only
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
Plot the distribution of access.rate of certified users:
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.